Homologous over-extension: a challenge for iterative similarity searches
نویسندگان
چکیده
We have characterized a novel type of PSI-BLAST error, homologous over-extension (HOE), using embedded PFAM domain queries on searches against a reference library containing Pfam-annotated UniProt sequences and random synthetic sequences. PSI-BLAST makes two types of errors: alignments to non-homologous regions and HOE alignments that begin in a homologous region, but extend beyond the homology into neighboring sequence regions. When the neighboring sequence region contains a non-homologous domain, PSI-BLAST can incorporate the unrelated sequence into its position specific scoring matrix, which then finds non-homologous proteins with significant expectation values. HOE accounts for the largest fraction of the initial false positive (FP) errors, and the largest fraction of FPs at iteration 5. In searches against complete protein sequences, 5-9% of alignments at iteration 5 are non-homologous. HOE frequently begins in a partial protein domain; when partial domains are removed from the library, HOE errors decrease from 16 to 3% of weighted coverage (hard queries; 35-5% for sampled queries) and no-error searches increase from 2 to 58% weighed coverage (hard; 16-78% sampled). When HOE is reduced by not extending previously found sequences, PSI-BLAST specificity improves 4-8-fold, with little loss in sensitivity.
منابع مشابه
PSI-Search: iterative HOE-reduced profile SSEARCH searching
UNLABELLED Iterative similarity searches with PSI-BLAST position-specific score matrices (PSSMs) find many more homologs than single searches, but PSSMs can be contaminated when homologous alignments are extended into unrelated protein domains-homologous over-extension (HOE). PSI-Search combines an optimal Smith-Waterman local alignment sequence search, using SSEARCH, with the PSI-BLAST profile...
متن کاملRefProtDom: a protein database with improved domain boundaries and homology relationships
UNLABELLED RefProtDom provides a set of divergent query domains, originally selected from Pfam, and full-length proteins containing their homologous domains, with diverse architectures, for evaluating pair-wise and iterative sequence similarity searches. Pfam homology and domain boundary annotations in the target library were supplemented using local and semi-global searches, PSI-BLAST searches...
متن کاملSequencing and comparative analysis of flagellin genes fliA and fliB in bovine Clostridium chauvoei isolates
Background: Clostridium chauvoei is the etiological agent of blackleg as an endogenous infection in cattle. Flagella have been known to play a critical role in the protective immunity of animals to clostridial infections. C.chauvoei has two copies of fliC gene, namely fliA and fliB. OBJECTIVES: The aim of this study was the determination and nucleotide sequence analysis of both copies of fliC ...
متن کاملRapid similarity search of proteins using alignments of domain arrangements
MOTIVATION Homology search methods are dominated by the central paradigm that sequence similarity is a proxy for common ancestry and, by extension, functional similarity. For determining sequence similarity in proteins, most widely used methods use models of sequence evolution and compare amino-acid strings in search for conserved linear stretches. Probabilistic models or sequence profiles capt...
متن کاملTarget enhanced 2D similarity search by using explicit biological activity annotations and profiles
BACKGROUND The enriched biological activity information of compounds in large and freely-accessible chemical databases like the PubChem Bioassay Database has become a powerful research resource for the scientific research community. Currently, 2D fingerprint based conventional similarity search (CSS) is the most common widely used approach for database screening, but it does not typically incor...
متن کامل